Owner: Daniel Soukup - Created: 2025.11.01
The goal of this notebook is to explore the dataset, understand our target column and features (statistical properties, data quality) and plan for preprocessing and modelling.
import dataiku
from dataiku import pandasutils as pdu
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.offline as pyo
pyo.init_notebook_mode()
Ensure the datasets are available:
client = dataiku.api_client()
project = client.get_project('US_CENSUS_PROJECT')
datasets = project.list_datasets()
for dataset in datasets:
print(dataset["name"])
census_income_learn census_income_test
Load the datasets:
def load_data_by_name(name: str) -> pd.DataFrame:
"""
Load dataset by its name.
"""
mydataset = dataiku.Dataset(name)
mydataset_df = mydataset.get_dataframe()
return mydataset_df
train_df = load_data_by_name("census_income_learn")
test_df = load_data_by_name("census_income_test")
train_df.shape, test_df.shape
((199523, 42), (99762, 42))
We have ~200K rows for testing and ~100K rows for testing with 42 columns, the last being our target. Note that we are missing the column names.
train_df.head()
| col_0 | col_1 | col_2 | col_3 | col_4 | col_5 | col_6 | col_7 | col_8 | col_9 | col_10 | col_11 | col_12 | col_13 | col_14 | col_15 | col_16 | col_17 | col_18 | col_19 | col_20 | col_21 | col_22 | col_23 | col_24 | col_25 | col_26 | col_27 | col_28 | col_29 | col_30 | col_31 | col_32 | col_33 | col_34 | col_35 | col_36 | col_37 | col_38 | col_39 | col_40 | col_41 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 73 | Not in universe | 0 | 0 | High school graduate | 0 | Not in universe | Widowed | Not in universe or children | Not in universe | White | All other | Female | Not in universe | Not in universe | Not in labor force | 0 | 0 | 0 | Nonfiler | Not in universe | Not in universe | Other Rel 18+ ever marr not in subfamily | Other relative of householder | 1700.09 | ? | ? | ? | Not in universe under 1 year old | ? | 0 | Not in universe | United-States | United-States | United-States | Native- Born in the United States | 0 | Not in universe | 2 | 0 | 95 | - 50000. |
| 1 | 58 | Self-employed-not incorporated | 4 | 34 | Some college but no degree | 0 | Not in universe | Divorced | Construction | Precision production craft & repair | White | All other | Male | Not in universe | Not in universe | Children or Armed Forces | 0 | 0 | 0 | Head of household | South | Arkansas | Householder | Householder | 1053.55 | MSA to MSA | Same county | Same county | No | Yes | 1 | Not in universe | United-States | United-States | United-States | Native- Born in the United States | 0 | Not in universe | 2 | 52 | 94 | - 50000. |
| 2 | 18 | Not in universe | 0 | 0 | 10th grade | 0 | High school | Never married | Not in universe or children | Not in universe | Asian or Pacific Islander | All other | Female | Not in universe | Not in universe | Not in labor force | 0 | 0 | 0 | Nonfiler | Not in universe | Not in universe | Child 18+ never marr Not in a subfamily | Child 18 or older | 991.95 | ? | ? | ? | Not in universe under 1 year old | ? | 0 | Not in universe | Vietnam | Vietnam | Vietnam | Foreign born- Not a citizen of U S | 0 | Not in universe | 2 | 0 | 95 | - 50000. |
| 3 | 9 | Not in universe | 0 | 0 | Children | 0 | Not in universe | Never married | Not in universe or children | Not in universe | White | All other | Female | Not in universe | Not in universe | Children or Armed Forces | 0 | 0 | 0 | Nonfiler | Not in universe | Not in universe | Child <18 never marr not in subfamily | Child under 18 never married | 1758.14 | Nonmover | Nonmover | Nonmover | Yes | Not in universe | 0 | Both parents present | United-States | United-States | United-States | Native- Born in the United States | 0 | Not in universe | 0 | 0 | 94 | - 50000. |
| 4 | 10 | Not in universe | 0 | 0 | Children | 0 | Not in universe | Never married | Not in universe or children | Not in universe | White | All other | Female | Not in universe | Not in universe | Children or Armed Forces | 0 | 0 | 0 | Nonfiler | Not in universe | Not in universe | Child <18 never marr not in subfamily | Child under 18 never married | 1069.16 | Nonmover | Nonmover | Nonmover | Yes | Not in universe | 0 | Both parents present | United-States | United-States | United-States | Native- Born in the United States | 0 | Not in universe | 0 | 0 | 94 | - 50000. |
test_df.head()
| col_0 | col_1 | col_2 | col_3 | col_4 | col_5 | col_6 | col_7 | col_8 | col_9 | col_10 | col_11 | col_12 | col_13 | col_14 | col_15 | col_16 | col_17 | col_18 | col_19 | col_20 | col_21 | col_22 | col_23 | col_24 | col_25 | col_26 | col_27 | col_28 | col_29 | col_30 | col_31 | col_32 | col_33 | col_34 | col_35 | col_36 | col_37 | col_38 | col_39 | col_40 | col_41 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 38 | Private | 6 | 36 | 1st 2nd 3rd or 4th grade | 0 | Not in universe | Married-civilian spouse present | Manufacturing-durable goods | Machine operators assmblrs & inspctrs | White | Mexican (Mexicano) | Female | Not in universe | Not in universe | Full-time schedules | 0 | 0 | 0 | Joint one under 65 & one 65+ | Not in universe | Not in universe | Spouse of householder | Spouse of householder | 1032.38 | ? | ? | ? | Not in universe under 1 year old | ? | 4 | Not in universe | Mexico | Mexico | Mexico | Foreign born- Not a citizen of U S | 0 | Not in universe | 2 | 12 | 95 | - 50000. |
| 1 | 44 | Self-employed-not incorporated | 37 | 12 | Associates degree-occup /vocational | 0 | Not in universe | Married-civilian spouse present | Business and repair services | Professional specialty | White | All other | Female | Not in universe | Not in universe | PT for econ reasons usually PT | 0 | 0 | 2500 | Joint both under 65 | Not in universe | Not in universe | Spouse of householder | Spouse of householder | 1462.33 | ? | ? | ? | Not in universe under 1 year old | ? | 1 | Not in universe | United-States | United-States | United-States | Native- Born in the United States | 0 | Not in universe | 2 | 26 | 95 | - 50000. |
| 2 | 2 | Not in universe | 0 | 0 | Children | 0 | Not in universe | Never married | Not in universe or children | Not in universe | White | Mexican-American | Male | Not in universe | Not in universe | Children or Armed Forces | 0 | 0 | 0 | Nonfiler | Not in universe | Not in universe | Child <18 never marr not in subfamily | Child under 18 never married | 1601.75 | ? | ? | ? | Not in universe under 1 year old | ? | 0 | Both parents present | United-States | United-States | United-States | Native- Born in the United States | 0 | Not in universe | 0 | 0 | 95 | - 50000. |
| 3 | 35 | Private | 29 | 3 | High school graduate | 0 | Not in universe | Divorced | Transportation | Executive admin and managerial | White | All other | Female | Not in universe | Not in universe | Children or Armed Forces | 0 | 0 | 0 | Head of household | Not in universe | Not in universe | Householder | Householder | 1866.88 | Nonmover | Nonmover | Nonmover | Yes | Not in universe | 5 | Not in universe | United-States | United-States | United-States | Native- Born in the United States | 2 | Not in universe | 2 | 52 | 94 | - 50000. |
| 4 | 49 | Private | 4 | 34 | High school graduate | 0 | Not in universe | Divorced | Construction | Precision production craft & repair | White | All other | Male | Not in universe | Not in universe | Full-time schedules | 0 | 0 | 0 | Single | Not in universe | Not in universe | Secondary individual | Nonrelative of householder | 1394.54 | ? | ? | ? | Not in universe under 1 year old | ? | 4 | Not in universe | United-States | United-States | United-States | Native- Born in the United States | 0 | Not in universe | 2 | 50 | 95 | - 50000. |
train_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 199523 entries, 0 to 199522 Data columns (total 42 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 col_0 199523 non-null int64 1 col_1 199523 non-null object 2 col_2 199523 non-null int64 3 col_3 199523 non-null int64 4 col_4 199523 non-null object 5 col_5 199523 non-null int64 6 col_6 199523 non-null object 7 col_7 199523 non-null object 8 col_8 199523 non-null object 9 col_9 199523 non-null object 10 col_10 199523 non-null object 11 col_11 198649 non-null object 12 col_12 199523 non-null object 13 col_13 199523 non-null object 14 col_14 199523 non-null object 15 col_15 199523 non-null object 16 col_16 199523 non-null int64 17 col_17 199523 non-null int64 18 col_18 199523 non-null int64 19 col_19 199523 non-null object 20 col_20 199523 non-null object 21 col_21 199523 non-null object 22 col_22 199523 non-null object 23 col_23 199523 non-null object 24 col_24 199523 non-null float64 25 col_25 199523 non-null object 26 col_26 199523 non-null object 27 col_27 199523 non-null object 28 col_28 199523 non-null object 29 col_29 199523 non-null object 30 col_30 199523 non-null int64 31 col_31 199523 non-null object 32 col_32 199523 non-null object 33 col_33 199523 non-null object 34 col_34 199523 non-null object 35 col_35 199523 non-null object 36 col_36 199523 non-null int64 37 col_37 199523 non-null object 38 col_38 199523 non-null int64 39 col_39 199523 non-null int64 40 col_40 199523 non-null int64 41 col_41 199523 non-null object dtypes: float64(1), int64(12), object(29) memory usage: 63.9+ MB
Observations:
It is clear that we need to do a fair bit of cleaning:
Excerpt from the data dict:
A quick check confirms that there are no exact duplicate columns per se:
train_df.T.duplicated().sum()
0
Lets look at the high level stats first:
# numeric columns
train_df.describe()
| col_0 | col_2 | col_3 | col_5 | col_16 | col_17 | col_18 | col_24 | col_30 | col_36 | col_38 | col_39 | col_40 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 199523.000000 | 199523.000000 | 199523.000000 | 199523.000000 | 199523.00000 | 199523.000000 | 199523.000000 | 199523.000000 | 199523.000000 | 199523.000000 | 199523.000000 | 199523.000000 | 199523.000000 |
| mean | 34.494199 | 15.352320 | 11.306556 | 55.426908 | 434.71899 | 37.313788 | 197.529533 | 1740.380269 | 1.956180 | 0.175438 | 1.514833 | 23.174897 | 94.499672 |
| std | 22.310895 | 18.067129 | 14.454204 | 274.896454 | 4697.53128 | 271.896428 | 1984.163658 | 993.768156 | 2.365126 | 0.553694 | 0.851473 | 24.411488 | 0.500001 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 37.870000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 94.000000 |
| 25% | 15.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 1061.615000 | 0.000000 | 0.000000 | 2.000000 | 0.000000 | 94.000000 |
| 50% | 33.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 1618.310000 | 1.000000 | 0.000000 | 2.000000 | 8.000000 | 94.000000 |
| 75% | 50.000000 | 33.000000 | 26.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 2188.610000 | 4.000000 | 0.000000 | 2.000000 | 52.000000 | 95.000000 |
| max | 90.000000 | 51.000000 | 46.000000 | 9999.000000 | 99999.00000 | 4608.000000 | 99999.000000 | 18656.300000 | 6.000000 | 2.000000 | 2.000000 | 52.000000 | 95.000000 |
train_df.select_dtypes("object").describe()
| col_1 | col_4 | col_6 | col_7 | col_8 | col_9 | col_10 | col_11 | col_12 | col_13 | col_14 | col_15 | col_19 | col_20 | col_21 | col_22 | col_23 | col_25 | col_26 | col_27 | col_28 | col_29 | col_31 | col_32 | col_33 | col_34 | col_35 | col_37 | col_41 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 199523 | 199523 | 199523 | 199523 | 199523 | 199523 | 199523 | 198649 | 199523 | 199523 | 199523 | 199523 | 199523 | 199523 | 199523 | 199523 | 199523 | 199523 | 199523 | 199523 | 199523 | 199523 | 199523 | 199523 | 199523 | 199523 | 199523 | 199523 | 199523 |
| unique | 9 | 17 | 3 | 7 | 24 | 15 | 5 | 9 | 2 | 3 | 6 | 8 | 6 | 6 | 51 | 38 | 8 | 10 | 9 | 10 | 3 | 4 | 5 | 43 | 43 | 43 | 5 | 3 | 2 |
| top | Not in universe | High school graduate | Not in universe | Never married | Not in universe or children | Not in universe | White | All other | Female | Not in universe | Not in universe | Children or Armed Forces | Nonfiler | Not in universe | Not in universe | Householder | Householder | ? | ? | ? | Not in universe under 1 year old | ? | Not in universe | United-States | United-States | United-States | Native- Born in the United States | Not in universe | - 50000. |
| freq | 100245 | 48407 | 186943 | 86485 | 100684 | 100684 | 167365 | 171907 | 103984 | 180459 | 193453 | 123769 | 75094 | 183750 | 183750 | 53248 | 75475 | 99696 | 99696 | 99696 | 101212 | 99696 | 144232 | 159163 | 160479 | 176989 | 176992 | 197539 | 187141 |
We'll use the data dictionary and number of unique values to map our columns to their proper names. We also need this information to make well-informed decisions on how the columns should be processed for modelling.
pd.set_option('display.max_colwidth', 100)
unique_values = pd.DataFrame(
{
"unique_values": [train_df[col].unique() for col in train_df.columns],
"num_unique": [train_df[col].nunique() for col in train_df.columns],
"dtype": [train_df[col].dtype for col in train_df.columns],
},
index=train_df.columns,
)
unique_values
| unique_values | num_unique | dtype | |
|---|---|---|---|
| col_0 | [73, 58, 18, 9, 10, 48, 42, 28, 47, 34, 8, 32, 51, 46, 26, 13, 39, 16, 35, 12, 27, 56, 55, 2, 1,... | 91 | int64 |
| col_1 | [Not in universe, Self-employed-not incorporated, Private, Local government, Federal government,... | 9 | object |
| col_2 | [0, 4, 40, 34, 43, 37, 24, 39, 12, 35, 45, 3, 19, 29, 32, 48, 33, 23, 44, 36, 31, 30, 41, 5, 11,... | 52 | int64 |
| col_3 | [0, 34, 10, 3, 40, 26, 37, 31, 12, 36, 41, 22, 2, 35, 25, 23, 42, 8, 19, 29, 27, 16, 33, 13, 18,... | 47 | int64 |
| col_4 | [High school graduate, Some college but no degree, 10th grade, Children, Bachelors degree(BA AB ... | 17 | object |
| col_5 | [0, 1200, 876, 500, 400, 450, 650, 475, 1100, 560, 1850, 830, 1925, 987, 2825, 709, 775, 5000, 6... | 1240 | int64 |
| col_6 | [Not in universe, High school, College or university] | 3 | object |
| col_7 | [Widowed, Divorced, Never married, Married-civilian spouse present, Separated, Married-spouse ab... | 7 | object |
| col_8 | [Not in universe or children, Construction, Entertainment, Finance insurance and real estate, Ed... | 24 | object |
| col_9 | [Not in universe, Precision production craft & repair, Professional specialty, Executive admin a... | 15 | object |
| col_10 | [White, Asian or Pacific Islander, Amer Indian Aleut or Eskimo, Black, Other] | 5 | object |
| col_11 | [All other, Do not know, Central or South American, Mexican (Mexicano), Mexican-American, Other ... | 9 | object |
| col_12 | [Female, Male] | 2 | object |
| col_13 | [Not in universe, No, Yes] | 3 | object |
| col_14 | [Not in universe, Job loser - on layoff, Other job loser, New entrant, Re-entrant, Job leaver] | 6 | object |
| col_15 | [Not in labor force, Children or Armed Forces, Full-time schedules, Unemployed full-time, Unempl... | 8 | object |
| col_16 | [0, 5178, 991, 2829, 3464, 5556, 7298, 15024, 1831, 3137, 10605, 20051, 2538, 3908, 2407, 2050, ... | 132 | int64 |
| col_17 | [0, 1590, 1977, 1669, 1719, 2444, 1421, 1848, 2205, 2149, 2001, 1902, 2090, 1573, 2415, 2377, 18... | 113 | int64 |
| col_18 | [0, 6000, 100, 700, 1000, 49, 59, 50, 10000, 275, 11744, 3095, 800, 16, 4, 70, 33000, 7, 2100, 4... | 1478 | int64 |
| col_19 | [Nonfiler, Head of household, Joint both under 65, Single, Joint both 65+, Joint one under 65 & ... | 6 | object |
| col_20 | [Not in universe, South, Northeast, Midwest, West, Abroad] | 6 | object |
| col_21 | [Not in universe, Arkansas, Utah, Michigan, Minnesota, Alaska, Kansas, Indiana, ?, Massachusetts... | 51 | object |
| col_22 | [Other Rel 18+ ever marr not in subfamily, Householder, Child 18+ never marr Not in a subfamily,... | 38 | object |
| col_23 | [Other relative of householder, Householder, Child 18 or older, Child under 18 never married, Sp... | 8 | object |
| col_24 | [1700.09, 1053.55, 991.95, 1758.14, 1069.16, 162.61, 1535.86, 898.83, 1661.53, 1146.79, 2466.24,... | 99800 | float64 |
| col_25 | [?, MSA to MSA, Nonmover, NonMSA to nonMSA, Not in universe, Not identifiable, Abroad to MSA, MS... | 10 | object |
| col_26 | [?, Same county, Nonmover, Different region, Different county same state, Not in universe, Diffe... | 9 | object |
| col_27 | [?, Same county, Nonmover, Different state in South, Different county same state, Not in univers... | 10 | object |
| col_28 | [Not in universe under 1 year old, No, Yes] | 3 | object |
| col_29 | [?, Yes, Not in universe, No] | 4 | object |
| col_30 | [0, 1, 6, 4, 5, 3, 2] | 7 | int64 |
| col_31 | [Not in universe, Both parents present, Mother only present, Neither parent present, Father only... | 5 | object |
| col_32 | [United-States, Vietnam, Philippines, ?, Columbia, Germany, Mexico, Japan, Peru, Dominican-Repub... | 43 | object |
| col_33 | [United-States, Vietnam, ?, Columbia, Mexico, El-Salvador, Peru, Puerto-Rico, Cuba, Philippines,... | 43 | object |
| col_34 | [United-States, Vietnam, ?, Columbia, Mexico, Peru, Cuba, Philippines, Dominican-Republic, El-Sa... | 43 | object |
| col_35 | [Native- Born in the United States, Foreign born- Not a citizen of U S, Foreign born- U S citize... | 5 | object |
| col_36 | [0, 2, 1] | 3 | int64 |
| col_37 | [Not in universe, No, Yes] | 3 | object |
| col_38 | [2, 0, 1] | 3 | int64 |
| col_39 | [0, 52, 30, 49, 32, 15, 38, 48, 9, 24, 50, 10, 45, 43, 4, 26, 40, 20, 6, 12, 51, 1, 8, 39, 13, 1... | 53 | int64 |
| col_40 | [95, 94] | 2 | int64 |
| col_41 | [- 50000., 50000+.] | 2 | object |
Based on this we put together our mapping:
col_mapping = {
"col_0": "age", # matches type and range
"col_1": "class of worker", # unique values checked with data dict (UVDD)
"col_2": "detailed industry recode", # UVDD
"col_3": "detailed occupation recode", # UVDD
"col_4": "education", # UVDD
"col_5": "wage per hour", # looks to be at right position, type checks, in cents?
"col_6": "enroll in edu inst last wk", # UVDD
"col_7": "marital stat", # UVDD
"col_8": "major industry code", # UVDD
"col_9": "major occupation code", # UVDD
"col_10": "race", # UVDD
"col_11": "hispanic origin", # UVDD - 10 unique in data dict? values match though
"col_12": "sex", # UVDD
"col_13": "member of a labor union", # UVDD
"col_14": "reason for unemployment", # UVDD
"col_15": "full or part time employment stat", # UVDD
"col_16": "capital gains", # data dict check, range ok, dollars?
"col_17": "capital losses", # data dict check, range ok, dollars?
"col_18": "dividends from stocks", # data dict check
"col_19": "tax filer stat", # UVDD
"col_20": "region of previous residence", # UVDD
"col_21": "state of previous residence", # UVDD
"col_22": "detailed household and family stat", # data dict check
"col_23": "detailed household summary in household", # data dict check
"col_24": "instance weight", # SPECIAL
"col_25": "migration code-change in msa", # UVDD
"col_26": "migration code-change in reg", # UVDD
"col_27": "migration code-move within reg", # UVDD
"col_28": "live in this house 1 year ago",# UVDD
"col_29": "migration prev res in sunbelt", # UVDD
"col_30": "num persons worked for employer", # value check
"col_31": "family members under 18", # UVDD
"col_32": "country of birth mother", # UVDD
"col_33": "country of birth self", # UVDD
"col_34": "country of birth father", # UVDD
"col_35": "citizenship", # UVDD
"col_36": "own business or self employed", # UVDD
"col_37": "fill inc questionnaire for veteran's admin", # UVDD
"col_38": "veterans benefits", # UVDD
"col_39": "weeks worked in year", # data dict order
"col_40": "year", # UVDD
"col_41": "income"
}
About the instance weight:
The instance weight indicates the number of people in the population that each record represents due to stratified sampling. To do real analysis and derive conclusions, this field must be used. This attribute should not be used in the classifiers, so it is set to "ignore" in this file.
We are ready to map the names and look into the mystery column:
train_df = train_df.rename(columns=col_mapping)
test_df = test_df.rename(columns=col_mapping)
train_df.head()
| age | class of worker | detailed industry recode | detailed occupation recode | education | wage per hour | enroll in edu inst last wk | marital stat | major industry code | major occupation code | race | hispanic origin | sex | member of a labor union | reason for unemployment | full or part time employment stat | capital gains | capital losses | dividends from stocks | tax filer stat | region of previous residence | state of previous residence | detailed household and family stat | detailed household summary in household | instance weight | migration code-change in msa | migration code-change in reg | migration code-move within reg | live in this house 1 year ago | migration prev res in sunbelt | num persons worked for employer | family members under 18 | country of birth mother | country of birth self | country of birth father | citizenship | own business or self employed | fill inc questionnaire for veteran's admin | veterans benefits | weeks worked in year | year | income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 73 | Not in universe | 0 | 0 | High school graduate | 0 | Not in universe | Widowed | Not in universe or children | Not in universe | White | All other | Female | Not in universe | Not in universe | Not in labor force | 0 | 0 | 0 | Nonfiler | Not in universe | Not in universe | Other Rel 18+ ever marr not in subfamily | Other relative of householder | 1700.09 | ? | ? | ? | Not in universe under 1 year old | ? | 0 | Not in universe | United-States | United-States | United-States | Native- Born in the United States | 0 | Not in universe | 2 | 0 | 95 | - 50000. |
| 1 | 58 | Self-employed-not incorporated | 4 | 34 | Some college but no degree | 0 | Not in universe | Divorced | Construction | Precision production craft & repair | White | All other | Male | Not in universe | Not in universe | Children or Armed Forces | 0 | 0 | 0 | Head of household | South | Arkansas | Householder | Householder | 1053.55 | MSA to MSA | Same county | Same county | No | Yes | 1 | Not in universe | United-States | United-States | United-States | Native- Born in the United States | 0 | Not in universe | 2 | 52 | 94 | - 50000. |
| 2 | 18 | Not in universe | 0 | 0 | 10th grade | 0 | High school | Never married | Not in universe or children | Not in universe | Asian or Pacific Islander | All other | Female | Not in universe | Not in universe | Not in labor force | 0 | 0 | 0 | Nonfiler | Not in universe | Not in universe | Child 18+ never marr Not in a subfamily | Child 18 or older | 991.95 | ? | ? | ? | Not in universe under 1 year old | ? | 0 | Not in universe | Vietnam | Vietnam | Vietnam | Foreign born- Not a citizen of U S | 0 | Not in universe | 2 | 0 | 95 | - 50000. |
| 3 | 9 | Not in universe | 0 | 0 | Children | 0 | Not in universe | Never married | Not in universe or children | Not in universe | White | All other | Female | Not in universe | Not in universe | Children or Armed Forces | 0 | 0 | 0 | Nonfiler | Not in universe | Not in universe | Child <18 never marr not in subfamily | Child under 18 never married | 1758.14 | Nonmover | Nonmover | Nonmover | Yes | Not in universe | 0 | Both parents present | United-States | United-States | United-States | Native- Born in the United States | 0 | Not in universe | 0 | 0 | 94 | - 50000. |
| 4 | 10 | Not in universe | 0 | 0 | Children | 0 | Not in universe | Never married | Not in universe or children | Not in universe | White | All other | Female | Not in universe | Not in universe | Children or Armed Forces | 0 | 0 | 0 | Nonfiler | Not in universe | Not in universe | Child <18 never marr not in subfamily | Child under 18 never married | 1069.16 | Nonmover | Nonmover | Nonmover | Yes | Not in universe | 0 | Both parents present | United-States | United-States | United-States | Native- Born in the United States | 0 | Not in universe | 0 | 0 | 94 | - 50000. |
We look at the single distributions for each column that will help us deciding on any preprocessing.
Next, look at the numeric fields:
train_df.select_dtypes(['int', 'float'])
| age | detailed industry recode | detailed occupation recode | wage per hour | capital gains | capital losses | dividends from stocks | instance weight | num persons worked for employer | own business or self employed | veterans benefits | weeks worked in year | year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 73 | 0 | 0 | 0 | 0 | 0 | 0 | 1700.09 | 0 | 0 | 2 | 0 | 95 |
| 1 | 58 | 4 | 34 | 0 | 0 | 0 | 0 | 1053.55 | 1 | 0 | 2 | 52 | 94 |
| 2 | 18 | 0 | 0 | 0 | 0 | 0 | 0 | 991.95 | 0 | 0 | 2 | 0 | 95 |
| 3 | 9 | 0 | 0 | 0 | 0 | 0 | 0 | 1758.14 | 0 | 0 | 0 | 0 | 94 |
| 4 | 10 | 0 | 0 | 0 | 0 | 0 | 0 | 1069.16 | 0 | 0 | 0 | 0 | 94 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 199518 | 87 | 0 | 0 | 0 | 0 | 0 | 0 | 955.27 | 0 | 0 | 2 | 0 | 95 |
| 199519 | 65 | 37 | 2 | 0 | 6418 | 0 | 9 | 687.19 | 1 | 0 | 2 | 52 | 94 |
| 199520 | 47 | 0 | 0 | 0 | 0 | 0 | 157 | 1923.03 | 6 | 0 | 2 | 52 | 95 |
| 199521 | 16 | 0 | 0 | 0 | 0 | 0 | 0 | 4664.87 | 0 | 0 | 2 | 0 | 95 |
| 199522 | 32 | 42 | 30 | 0 | 0 | 0 | 0 | 1830.11 | 6 | 0 | 2 | 52 | 94 |
199523 rows Ć 13 columns
num_only_df = train_df.select_dtypes(["int", "float"])
unique_values = pd.DataFrame(
{
"num_unique": [num_only_df[col].nunique() for col in num_only_df.columns],
"dtype": [num_only_df[col].dtype for col in num_only_df.columns],
},
index=num_only_df.columns,
)
unique_values
| num_unique | dtype | |
|---|---|---|
| age | 91 | int64 |
| detailed industry recode | 52 | int64 |
| detailed occupation recode | 47 | int64 |
| wage per hour | 1240 | int64 |
| capital gains | 132 | int64 |
| capital losses | 113 | int64 |
| dividends from stocks | 1478 | int64 |
| instance weight | 99800 | float64 |
| num persons worked for employer | 7 | int64 |
| own business or self employed | 3 | int64 |
| veterans benefits | 3 | int64 |
| weeks worked in year | 53 | int64 |
| year | 2 | int64 |
There are few fields which we should treat as categorical instead since the numbers encode independent values with no numeric relationship:veterans benefits, own business, and the two the recodes.
One might consider adding "num persons worked for employer" and "year" as well, given the concentration to only a few unique values.
train_df["num persons worked for employer"].value_counts(normalize=True)
num persons worked for employer 0 0.481062 6 0.182991 1 0.115821 4 0.072067 3 0.067285 2 0.050526 5 0.030247 Name: proportion, dtype: float64
train_df["year"].value_counts(normalize=True)
year 94 0.500328 95 0.499672 Name: proportion, dtype: float64
Let's convert these and process with the rest of the categorical columns:
num_to_object_cols = [
"detailed industry recode",
"detailed occupation recode",
"own business or self employed",
"veterans benefits",
"year",
"num persons worked for employer",
]
for col in num_to_object_cols:
train_df[col] = train_df[col].astype("object")
Look at high level stats:
num_cols = train_df.select_dtypes('number')
num_cols.describe()
| age | wage per hour | capital gains | capital losses | dividends from stocks | ??? | weeks worked in year | |
|---|---|---|---|---|---|---|---|
| count | 199523.000000 | 199523.000000 | 199523.00000 | 199523.000000 | 199523.000000 | 199523.000000 | 199523.000000 |
| mean | 34.494199 | 55.426908 | 434.71899 | 37.313788 | 197.529533 | 1740.380269 | 23.174897 |
| std | 22.310895 | 274.896454 | 4697.53128 | 271.896428 | 1984.163658 | 993.768156 | 24.411488 |
| min | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 37.870000 | 0.000000 |
| 25% | 15.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 1061.615000 | 0.000000 |
| 50% | 33.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 1618.310000 | 8.000000 |
| 75% | 50.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 2188.610000 | 52.000000 |
| max | 90.000000 | 9999.000000 | 99999.00000 | 4608.000000 | 99999.000000 | 18656.300000 | 52.000000 |
We see some major skew in these fields and some curious values, such as the 9999 and 99999 for the dollar values.
def plot_numeric(column: str) -> None:
fig = px.histogram(train_df, x=column, title=f"{column} distribution")
fig.show()
for col in num_cols.columns:
plot_numeric(col)
Observations:
It is likely that either 9999 was used as a filler value in these columns or used as an artificial maximum to protect the privacy of some outlier data subjects.
(train_df['wage per hour'] == 9999).sum()
1
(train_df['capital gains'] == 99999).sum()
390
(train_df['dividends from stocks'] == 99999).sum()
25
These are extreme outlier values of the fields (based on the percentiles above). We also saw that there are a really high % of values falling exactly at 0:
for col in num_cols.columns:
fig = px.histogram(train_df.loc[train_df[col] > 0], x=col)
fig.show()
This allows us to see the distribution better.
Proposal
In the future, we can apply log-transformation, scaling to these fields if the models require. We can alternatively consider binning to create categorical fields from these variables too (based on uniform or percentile bins).
Lets consider now the categorical fields.
pyo.init_notebook_mode()
TARGET = 'income'
def plot_categories(column: str) -> None:
"""
Plot top-20 value counts.
"""
counts = train_df[column].value_counts(normalize=True).head(20)
fig = px.bar(counts, title=f"{column} distribution")
fig.show()
plot_categories(TARGET)
We have a highly imbalance distribution, with only 6% of train samples in the +50K category. This will likely need addressing, the classification models will be affected and in general, we expect worse performance on this minority class.
Lets see the rest of the columns:
pyo.init_notebook_mode()
for col in train_df.select_dtypes("object"):
plot_categories(col)
We see most columns are again imbalanced with a high number of unique values and long tails.
train_df.select_dtypes("object").nunique()
class of worker 9 detailed industry recode 52 detailed occupation recode 47 education 17 enroll in edu inst last wk 3 marital stat 7 major industry code 24 major occupation code 15 race 5 hispanic origin 9 sex 2 member of a labor union 3 reason for unemployment 6 full or part time employment stat 8 tax filer stat 6 region of previous residence 6 state of previous residence 51 detailed household and family stat 38 detailed household summary in household 8 migration code-change in msa 10 migration code-change in reg 9 migration code-move within reg 10 live in this house 1 year ago 3 migration prev res in sunbelt 4 num persons worked for employer 7 family members under 18 5 country of birth mother 43 country of birth self 43 country of birth father 43 citizenship 5 own business or self employed 3 fill inc questionnaire for veteran's admin 3 veterans benefits 3 year 2 income 2 dtype: int64
If all columns are dummy encoded we get > 450 cols (once one of each category dropped):
train_df.select_dtypes("object").nunique().sum() - train_df.shape[1]
469
In terms of encoding and processing:
Proposal:
We see there is only a single column with missing values, for <0.5% of the rows:
train_df.isna().mean()
age 0.00000 class of worker 0.00000 detailed industry recode 0.00000 detailed occupation recode 0.00000 education 0.00000 wage per hour 0.00000 enroll in edu inst last wk 0.00000 marital stat 0.00000 major industry code 0.00000 major occupation code 0.00000 race 0.00000 hispanic origin 0.00438 sex 0.00000 member of a labor union 0.00000 reason for unemployment 0.00000 full or part time employment stat 0.00000 capital gains 0.00000 capital losses 0.00000 dividends from stocks 0.00000 tax filer stat 0.00000 region of previous residence 0.00000 state of previous residence 0.00000 detailed household and family stat 0.00000 detailed household summary in household 0.00000 ??? 0.00000 migration code-change in msa 0.00000 migration code-change in reg 0.00000 migration code-move within reg 0.00000 live in this house 1 year ago 0.00000 migration prev res in sunbelt 0.00000 num persons worked for employer 0.00000 family members under 18 0.00000 country of birth mother 0.00000 country of birth self 0.00000 country of birth father 0.00000 citizenship 0.00000 own business or self employed 0.00000 fill inc questionnaire for veteran's admin 0.00000 veterans benefits 0.00000 weeks worked in year 0.00000 year 0.00000 income 0.00000 dtype: float64
train_df.isna().sum().sum()
874
From the data glossary provided:
Hispanic Origin Persons of Hispanic origin in this file are determined on the basis of a question asking if the person is Spanish, Hispanic, or Latino. If the response is āyes,ā a follow-up question determines a specific ethnic origin, asking to select their (the personās) origin from a āflash cardā listing. The flash-card selections are Mexican, Mexican- American, Chicano, Puerto Rican, Cuban, Cuban American, or some other Spanish, Hispanic, or Latino group.
If these values are missing at random, dropping them likely make little difference to the modelling.
train_df.loc[train_df.isnull().any(axis=1), "race"].value_counts(normalize=True)
race White 0.843249 Black 0.107551 Amer Indian Aleut or Eskimo 0.017162 Other 0.016018 Asian or Pacific Islander 0.016018 Name: proportion, dtype: float64
train_df['race'].value_counts(normalize=True)
race White 0.838826 Black 0.102319 Asian or Pacific Islander 0.029245 Other 0.018329 Amer Indian Aleut or Eskimo 0.011282 Name: proportion, dtype: float64
train_df['hispanic origin'].value_counts(normalize=True)
hispanic origin All other 0.865381 Mexican-American 0.040670 Mexican (Mexicano) 0.036416 Central or South American 0.019607 Puerto Rican 0.016678 Other Spanish 0.012510 Cuban 0.005668 Do not know 0.001540 Chicano 0.001530 Name: proportion, dtype: float64
train_df.loc[train_df.isnull().any(axis=1), "income"].value_counts(normalize=True)
income - 50000. 0.934783 50000+. 0.065217 Name: proportion, dtype: float64
It does not seem to be skewed wrt our target or the 'race' column.
Proposal: Since the hispanic origin column already has a Do not know value, we can fill with that.
Lets see how many duplicate rows we have:
train_df.duplicated().sum()
3229
This is already significant, but if we look at without our mistery column:
train_df.drop(columns="instance weight").duplicated().sum()
46627
We can see that without our mistery column, we get the approximate number of duplicate instances mentioned in the data dictionary.
Number of instances data = 199523 Duplicate or conflicting instances : 46716 Number of instances in test = 99762 Duplicate or conflicting instances : 20936 Class probabilities for income-projected.test file
It is unclear where these are coming from and there is no ID to identify rows (unsurprisingly). Its also a very large number of rows to drop from our data and it could be that some of these duplicates are legitimate (two people with the same recorded attributes).
without_duplicates = train_df.drop_duplicates(subset=train_df.drop(columns="instance weight").columns)
without_duplicates.shape
(152896, 42)
Note: the Census does put a fair effort into deduplication. https://www.census.gov/newsroom/blogs/random-samplings/2021/04/how_we_unduplicated.html
There are several reasons for duplicates in a census: We receive more than one response for an address. People are counted in more than one place because of potentially complex living situations. There might be an issue with the address ā a housing unit is on our address list more than once or census materials are misdelivered. We use a special algorithm to resolve the first situation and a series of steps to resolve the second and third.
We'd like to get a sense which columns are most strongly related to the target. We can do this by standard statistics and tests (chi2 or f-score). We leave the feature selection to the preprocessing notebook and explore select columns here that are promising predictors.
Only weak relationship between the numeric features.
fig = px.imshow(
train_df.select_dtypes('number').corr(),
text_auto=True,
color_continuous_scale='RdBu_r',
zmin=-1,
zmax=1
)
fig.show()
Between our target and the numeric features:
pyo.init_notebook_mode()
for col in train_df.select_dtypes('number'):
fig = px.box(train_df, x=col, color="income")
fig.show()
After a bit of transformations, we can get a better visual:
pyo.init_notebook_mode()
for col in train_df.select_dtypes('number'):
filtered = train_df[train_df[col] > 0] # Note there are lot of data points filtered
filtered[col] = np.log1p(filtered[col])
fig = px.box(filtered, x=col, color="income")
fig.show()
/tmp/ipykernel_90/1110306582.py:6: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
/tmp/ipykernel_90/1110306582.py:6: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
/tmp/ipykernel_90/1110306582.py:6: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
/tmp/ipykernel_90/1110306582.py:6: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
/tmp/ipykernel_90/1110306582.py:6: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
/tmp/ipykernel_90/1110306582.py:6: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Observations:
Finally, looking at our categorical fields, we can check pivot tables and chi2 scores:
from scipy.stats import chi2_contingency
# run on a sample, chi2 will give more false pos with super large samples
sample = train_df.sample(1000)
results = []
for col in train_df.select_dtypes('object').drop(columns=TARGET):
contingency_table = sample.groupby([col, TARGET])[TARGET].count().unstack().fillna(0)
chi2_stat, p_value, dof, expected = chi2_contingency(contingency_table)
results.append(
{
'col': col,
'chi2': chi2_stat,
'p_value': p_value,
}
)
chi2_results = pd.DataFrame(results)
# significant with 0.1% cutoff to account for multiple testing
chi2_results[chi2_results["p_value"] < 0.001]
| col | chi2 | p_value | |
|---|---|---|---|
| 0 | class of worker | 91.818768 | 1.986259e-16 |
| 1 | detailed industry recode | 128.553925 | 5.663176e-10 |
| 2 | detailed occupation recode | 236.852983 | 3.061579e-28 |
| 3 | education | 178.368508 | 1.788210e-29 |
| 5 | marital stat | 57.480644 | 1.460121e-10 |
| 6 | major industry code | 96.551392 | 1.168983e-11 |
| 7 | major occupation code | 145.182768 | 1.920062e-24 |
| 10 | sex | 40.598459 | 1.869541e-10 |
| 13 | full or part time employment stat | 37.576157 | 3.647394e-06 |
| 14 | tax filer stat | 65.917157 | 7.231013e-13 |
| 17 | detailed household and family stat | 78.833498 | 1.263823e-08 |
| 18 | detailed household summary in household | 74.661860 | 1.094412e-14 |
| 24 | num persons worked for employer | 76.649295 | 1.755487e-14 |
| 25 | family members under 18 | 30.149694 | 4.562721e-06 |
| 32 | veterans benefits | 27.497599 | 1.068986e-06 |
We can see that these are reasonable columns, industry, class of worker, employment status, etc are highly likely to affect the income. We expect these to show up in our modelling as well.